Skip to content

Conversation

@dnikolaev-amd
Copy link

@dnikolaev-amd dnikolaev-amd commented Jul 15, 2025

Skip for test_nn.py::TestNN.test_batchnorm_3D_train_NCHW_vs_native_mixed_float16
Test failed on weight gradient comparison MIOpen/CuDNN vs Native batchnorm.

But CPU test test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16 passed.
It looks like FP16 Native batchnorm issue.

Failed on MI200/MI300 and V100
It passed somehow on Navi (with enabled MIOpen)

Fixes SWDEV-541024, SWDEV-539171

python test_nn.py -v -k test_batchnorm_3D_train_NCHW_vs_native_mixed_float16

test_batchnorm_3D_train_NCHW_vs_native_mixed_float16 (__main__.TestNN) ... skipped '3D float16 NCHW train failed on CUDA and ROCm due to Native batchnorm accuracy issue SWDEV-541024'

OK (skipped=1)

Cherry-picked to release/2.7 branch via #2390

Cherry-picked to release/2.6 branch via #2391

Cherry-picked to release/2.8 branch via #2652

Cherry-picked to release/2.9 branch via #2788

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jul 15, 2025

Jenkins build for 2f9e18c5fb255cbd1f554c070bcf6d852ab9b848 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@dnikolaev-amd dnikolaev-amd changed the title Skip 3D NCHW FP16 batchnorm test due to Native accuracy issue [rocm7.0_internal_testing] skip 3D NCHW FP16 batchnorm test due to Native accuracy issue Jul 15, 2025
@pruthvistony pruthvistony merged commit 4eaa5bf into rocm7.0_internal_testing Jul 19, 2025
0 of 4 checks passed
@pruthvistony pruthvistony deleted the skip_fp16_nchw_native_batchnorm_test branch July 19, 2025 05:26
@dnikolaev-amd
Copy link
Author

! cherry-pick --onto release/2.7

@dnikolaev-amd
Copy link
Author

! cherry-pick --onto release/2.6

okakarpa pushed a commit that referenced this pull request Jul 21, 2025
…tive accuracy issue (#2370)

Skip for
`test_nn.py::TestNN.test_batchnorm_3D_train_NCHW_vs_native_mixed_float16`
Test failed on `weight gradient` comparison MIOpen/CuDNN vs Native
batchnorm.

But CPU test `test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16` passed.
It looks like FP16 Native batchnorm issue.

Failed on MI200/MI300 and V100
It passed somehow on Navi (with enabled MIOpen)

Fixes SWDEV-541024, SWDEV-539171

```
python test_nn.py -v -k test_batchnorm_3D_train_NCHW_vs_native_mixed_float16

test_batchnorm_3D_train_NCHW_vs_native_mixed_float16 (__main__.TestNN) ... skipped '3D float16 NCHW train failed on CUDA and ROCm due to Native batchnorm accuracy issue SWDEV-541024'

OK (skipped=1)
```
@okakarpa
Copy link
Collaborator

Created branch autogenerated/release/2.7_cherry-pick_pr-2370 and #2390

okakarpa pushed a commit that referenced this pull request Jul 21, 2025
…tive accuracy issue (#2370)

Skip for
`test_nn.py::TestNN.test_batchnorm_3D_train_NCHW_vs_native_mixed_float16`
Test failed on `weight gradient` comparison MIOpen/CuDNN vs Native
batchnorm.

But CPU test `test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16` passed.
It looks like FP16 Native batchnorm issue.

Failed on MI200/MI300 and V100
It passed somehow on Navi (with enabled MIOpen)

Fixes SWDEV-541024, SWDEV-539171

```
python test_nn.py -v -k test_batchnorm_3D_train_NCHW_vs_native_mixed_float16

test_batchnorm_3D_train_NCHW_vs_native_mixed_float16 (__main__.TestNN) ... skipped '3D float16 NCHW train failed on CUDA and ROCm due to Native batchnorm accuracy issue SWDEV-541024'

OK (skipped=1)
```
@okakarpa
Copy link
Collaborator

Created branch autogenerated/release/2.6_cherry-pick_pr-2370 and #2391

jithunnair-amd pushed a commit that referenced this pull request Jul 24, 2025
… Native accuracy issue (#2391)

Cherry-pick of #2370

Co-authored-by: Dmitry Nikolaev <[email protected]>
jithunnair-amd pushed a commit that referenced this pull request Jul 24, 2025
… Native accuracy issue (#2390)

Cherry-pick of #2370

Co-authored-by: Dmitry Nikolaev <[email protected]>
pruthvistony pushed a commit that referenced this pull request Aug 2, 2025
#2440)

This PR has fixes for P1 Jira
https://ontrack-internal.amd.com/browse/SWDEV-542659.
In this Jira, there are 3 test files with failing tests.
1) distributed.test_distributed_spawn
2) test_binary_ufuncs
3) test_nn 

The test files **distributed.test_distributed_spawn** &
**test_binary_ufuncs** are passing with latest mainline build-

**registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16426_ubuntu22.04_py3.10_pytorch_lw_release-2.7_fe3d37a9**.

The test file **test_nn** has 2 failing tests-
**test_batchnorm_3D_train_NCHW_vs_native_mixed_float16** &
**test_RNN_dropout_state**.
The **test_batchnorm_3D_train_NCHW_vs_native_mixed_float16** test is
skipped from PR #2370.
The **test_RNN_dropout_state** is fixed by cherry picking upstream
commit 1aa971a.

Tested on MI200 with docker image-

**registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16426_ubuntu22.04_py3.10_pytorch_lw_release-2.7_fe3d37a9**.

---------

Co-authored-by: Iurii Paikov <[email protected]>
Co-authored-by: Jeff Daily <[email protected]>
Co-authored-by: Nikita Shulga <[email protected]>
dhonnappa-amd pushed a commit that referenced this pull request Aug 13, 2025
#2440)

This PR has fixes for P1 Jira
https://ontrack-internal.amd.com/browse/SWDEV-542659.
In this Jira, there are 3 test files with failing tests.
1) distributed.test_distributed_spawn
2) test_binary_ufuncs
3) test_nn 

The test files **distributed.test_distributed_spawn** &
**test_binary_ufuncs** are passing with latest mainline build-

**registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16426_ubuntu22.04_py3.10_pytorch_lw_release-2.7_fe3d37a9**.

The test file **test_nn** has 2 failing tests-
**test_batchnorm_3D_train_NCHW_vs_native_mixed_float16** &
**test_RNN_dropout_state**.
The **test_batchnorm_3D_train_NCHW_vs_native_mixed_float16** test is
skipped from PR #2370.
The **test_RNN_dropout_state** is fixed by cherry picking upstream
commit 1aa971a.

Tested on MI200 with docker image-

**registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16426_ubuntu22.04_py3.10_pytorch_lw_release-2.7_fe3d37a9**.

---------

Co-authored-by: Iurii Paikov <[email protected]>
Co-authored-by: Jeff Daily <[email protected]>
Co-authored-by: Nikita Shulga <[email protected]>
dhonnappa-amd pushed a commit that referenced this pull request Aug 13, 2025
#2440)

This PR has fixes for P1 Jira
https://ontrack-internal.amd.com/browse/SWDEV-542659.
In this Jira, there are 3 test files with failing tests.
1) distributed.test_distributed_spawn
2) test_binary_ufuncs
3) test_nn 

The test files **distributed.test_distributed_spawn** &
**test_binary_ufuncs** are passing with latest mainline build-

**registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16426_ubuntu22.04_py3.10_pytorch_lw_release-2.7_fe3d37a9**.

The test file **test_nn** has 2 failing tests-
**test_batchnorm_3D_train_NCHW_vs_native_mixed_float16** &
**test_RNN_dropout_state**.
The **test_batchnorm_3D_train_NCHW_vs_native_mixed_float16** test is
skipped from PR #2370.
The **test_RNN_dropout_state** is fixed by cherry picking upstream
commit 1aa971a.

Tested on MI200 with docker image-

**registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16426_ubuntu22.04_py3.10_pytorch_lw_release-2.7_fe3d37a9**.

---------

Co-authored-by: Iurii Paikov <[email protected]>
Co-authored-by: Jeff Daily <[email protected]>
Co-authored-by: Nikita Shulga <[email protected]>
dhonnappa-amd pushed a commit that referenced this pull request Aug 13, 2025
#2440)

This PR has fixes for P1 Jira
https://ontrack-internal.amd.com/browse/SWDEV-542659.
In this Jira, there are 3 test files with failing tests.
1) distributed.test_distributed_spawn
2) test_binary_ufuncs
3) test_nn 

The test files **distributed.test_distributed_spawn** &
**test_binary_ufuncs** are passing with latest mainline build-

**registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16426_ubuntu22.04_py3.10_pytorch_lw_release-2.7_fe3d37a9**.

The test file **test_nn** has 2 failing tests-
**test_batchnorm_3D_train_NCHW_vs_native_mixed_float16** &
**test_RNN_dropout_state**.
The **test_batchnorm_3D_train_NCHW_vs_native_mixed_float16** test is
skipped from PR #2370.
The **test_RNN_dropout_state** is fixed by cherry picking upstream
commit 1aa971a.

Tested on MI200 with docker image-

**registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16426_ubuntu22.04_py3.10_pytorch_lw_release-2.7_fe3d37a9**.

---------

Co-authored-by: Iurii Paikov <[email protected]>
Co-authored-by: Jeff Daily <[email protected]>
Co-authored-by: Nikita Shulga <[email protected]>
@dnikolaev-amd
Copy link
Author

! cherry-pick --onto release/2.8

dhonnappa-amd pushed a commit that referenced this pull request Sep 18, 2025
…tive accuracy issue (#2370)

Skip for
`test_nn.py::TestNN.test_batchnorm_3D_train_NCHW_vs_native_mixed_float16`
Test failed on `weight gradient` comparison MIOpen/CuDNN vs Native
batchnorm.

But CPU test `test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16` passed.
It looks like FP16 Native batchnorm issue.

Failed on MI200/MI300 and V100
It passed somehow on Navi (with enabled MIOpen)

Fixes SWDEV-541024, SWDEV-539171

```
python test_nn.py -v -k test_batchnorm_3D_train_NCHW_vs_native_mixed_float16

test_batchnorm_3D_train_NCHW_vs_native_mixed_float16 (__main__.TestNN) ... skipped '3D float16 NCHW train failed on CUDA and ROCm due to Native batchnorm accuracy issue SWDEV-541024'

OK (skipped=1)
```
@dhonnappa-amd
Copy link

Created branch autogenerated/release/2.8_cherry-pick_pr-2370 and #2652

Comment processed by Build

dhonnappa-amd added a commit that referenced this pull request Sep 18, 2025
@dnikolaev-amd
Copy link
Author

! cherry-pick --onto release/2.9

@rocm-repo-management-api
Copy link

Created branch autogenerated/release/2.9_cherry-pick_pr-2370 and #2788. It contains a merge conflict. Please resolve it

Comment processed by Build

jeffdaily pushed a commit that referenced this pull request Nov 5, 2025
… Native accuracy issue (#2788)

Skip for `test_batchnorm_3D_train_NCHW_vs_native_mixed_float16`
Cherry-pick of #2370 
~Need to resolve conflicts~ - resolved

---------

Co-authored-by: Dmitry Nikolaev <[email protected]>
jeffdaily pushed a commit that referenced this pull request Nov 17, 2025
… Native accuracy issue (#2788)

Skip for `test_batchnorm_3D_train_NCHW_vs_native_mixed_float16`
Cherry-pick of #2370 
~Need to resolve conflicts~ - resolved

---------

Co-authored-by: Dmitry Nikolaev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants